Deep neural network context embeddings for model selection in rich-context HMM synthesis
نویسندگان
چکیده
This paper introduces a novel form of parametric synthesis that uses context embeddings produced by the bottleneck layer of a deep neural network to guide the selection of models in a rich-context HMM-based synthesiser. Rich-context synthesis – in which Gaussian distributions estimated from single linguistic contexts seen in the training data are used for synthesis, rather than more conventional decision tree-tied models – was originally proposed to address over-smoothing due to averaging across contexts. Our previous investigations have confirmed experimentally that averaging across different contexts is indeed one of the largest factors contributing to the limited quality of statistical parametric speech synthesis. However, a possible weakness of the rich context approach as previously formulated is that a conventional tied model is still used to guide selection of Gaussians at synthesis time. Our proposed approach replaces this with context embeddings derived from a neural network.
منابع مشابه
Phonetic Context Embeddings for DNN-HMM Phone Recognition
This paper proposes an approach, named phonetic context embedding, to model phonetic context effects for deep neural network hidden Markov model (DNN-HMM) phone recognition. Phonetic context embeddings can be regarded as continuous and distributed vector representations of context-dependent phonetic units (e.g., triphones). In this work they are computed using neural networks. First, all phone ...
متن کاملThe USTC System for Blizzard Challenge 2016
This paper introduces the details of the speech synthesis entry developed by the USTC team for Blizzard Challenge 2016. A 5-hour corpus of highly expressive children’s audiobook was released this year to the participants. An hidden Markov model (HMM)-based unit selection system was built for the task. In addition, we utilized deep neural networks to improve the performance of our system, in bot...
متن کاملAcoustic Modeling in Statistical Parametric Speech Synthesis – from Hmm to Lstm-rnn
Statistical parametric speech synthesis (SPSS) combines an acoustic model and a vocoder to render speech given a text. Typically decision tree-clustered context-dependent hidden Markov models (HMMs) are employed as the acoustic model, which represent a relationship between linguistic and acoustic features. Recently, artificial neural network-based acoustic models, such as deep neural networks, ...
متن کاملAn investigation of context clustering for statistical speech synthesis with deep neural network
The state-of-the-art DNN speech synthesis system directly maps linguistic input to acoustic output and voice quality improvement over the conventional MSD-GMM-HMM synthesis system has been reported. DNN-based speech synthesis system does not require context clustering as in GMM-HMM systems and this was believed to be the main advantage and contributor to performance improvement. Our previous wo...
متن کاملشبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کامل